Accurate and Efficient Crawling for Relevant Websites
نویسندگان
چکیده
Focused web crawlers have recently emerged as an alternative to the well-established web search engines. While the well-known focused crawlers retrieve relevant webpages, there are various applications which target whole websites instead of single webpages. For example, companies are represented by websites, not by individual webpages. To answer queries targeted at websites, web directories are an established solution. In this paper, we introduce a novel focused website crawler to employ the paradigm of focused crawling for the search of relevant websites. The proposed crawler is based on a two-level architecture and corresponding crawl strategies with an explicit concept of websites. The external crawler views the web as a graph of linked websites, selects the websites to be examined next and invokes internal crawlers. Each internal crawler views the webpages of a single given website and performs focused (page) crawling within that website. Our experimental evaluation demonstrates that the proposed focused website crawler clearly outperforms previous methods of focused crawling which were adapted to retrieve websites instead of single webpages.
منابع مشابه
Prioritize the ordering of URL queue in Focused crawler
The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...
متن کاملEfficient Social Website Crawling Using Cluster Graph ; CU-CS-1056-09
Online social communities have gained significant popularity in recent years and have become an area of active research. Compared with general websites or well-structured Web forums, user-centered social websites pose several unique challenges for crawling, a fundamental task for data collection and data mining of large-scale online social communities: (1) Social websites have more complex link...
متن کاملEfficient Social Website Crawling Using Cluster Graph
Online social communities have gained significant popularity in recent years and have become an area of active research. Compared with general websites or well-structured Web forums, user-centered social websites pose several unique challenges for crawling, a fundamental task for data collection and data mining of large-scale online social communities: (1) Social websites have more complex link...
متن کاملSIGCSE: U: Focused Retrieval of University Course Descriptions from Highly Variable Sources
Finding topically relevant content from sparse disparate sources on the Web requires robust techniques. A focused web crawler is a type of crawler that attempts to make predictions about page relevance and traverse the web efficiently to retrieve relevant information. In this work, we design and test a novel framework of focused crawling tailored to extracting semantically relevant information ...
متن کاملProfile-Based Focused Crawling for Social Media-Sharing Websites
We present a novel profile-based focused crawling system for dealing with the increasingly popular social media-sharing websites. In this system, we treat the user profiles as ranking criteria for guiding the crawling process. Furthermore, we divide a user’s profile into two parts, an internal part, which comes from the user’s own contribution, and an external part, which comes from the user’s ...
متن کامل